Introduction

The assignment is focused on solving the Forest Cover Type Prediction: https://www.kaggle.com/c/forest-cover-type-prediction/overview. This task proposes a classification problem: predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data).

The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:

  1. Spruce/Fir
  2. Lodgepole Pine
  3. Ponderosa Pine
  4. Cottonwood/Willow
  5. Aspen
  6. Douglas-fir
  7. Krummholz

The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features.

You must predict the Cover_Type for every row in the test set (565892 observations).

How to work on the assignment

I expect 3 files from each group:

No group presentation is expected for this assignment.

0. Understanding the Problem / Data

Our project focuses on a multi-class classification problem, we need to provide a prediction of the cover-types for the test set. Our goal focuses on finding the best possible algorithm that can predict with some level accuracy the correct cover types.

1. Data Collection, understanding the dataset.

Our dataset consists of 15120 observations from the 4 wilderness areas in the Roosevelt National Forest in Northern Colorado. Each observation is a 30m x 30m patch classified with one of seven different cover types, which will be our target variable.

2. Initial Analysis.

2.1 Pandas Profiling

Pandas profiling is a great tool for getting an initial overview of the dataset, as it provides many diferent insights in just a couple lines of code.

2.2 Null Values

We will start by looking at some basic information about the dataset, as well as if there are any null values within it.

2.3 Asessing the Distribution

2.4 Analysis Summary

After observing the distribution of each variable, we can see most of them are already "dummyfied". Only 10 of our columns are actually continuous variables, and the rest are categorical. Some of our categorical variables like 'Soil_Type7' and 'Soil_Type15' only contain one type of value, therefore it might be a good idea to drop them to reduce noise within the dataset.

We can also see that our target variable is pretty evenly distributed, therefore we can say that we have a balanced dataset.

3. Data Cleaning and Preparation

3.1 Splitting the Dataset

In order to avoid Data Leakage, we will split the dataset into our test and training sets before performing any transformations.

3.2 Creating a Baseline Model

We will first create a baseline model (a Decision Tree Classifier) which we will use to compare and decide between different algorithms applied secuentially to transform the data in order to improve our classification score.

We use the 'macro' average in order for the algorithm to give the same weight to each class, as the dataset is balanced.

3.3 Initial Pipeline Creation

3.4 Outlier Detection

In order to determine if we should take outliers into account, we will train a baseline decision tree model and calculate the F1 Score and compare it with 4 different outlier detection algorithms to determine which should we use (if any).

Given that the scores for the 4 different outlier detection methods didn't improve our model and only use up processing power we will keep outliers within the datset.

4. Feature Engineering

We will start by creating new features from diferent combinations of the numerical features by addition, substraction and obtaining the mean between variables

Afterwards we will generate more features using PolynomialFeatures and an exponential transformer.

Transformed Features Pipeline

5. Feature Selection

In order to reduce the complexity of our model, and escape the curse of dimensionality we will select the best features using a filtering method using SelectFpr, which will rank and drop features based on their pvalues.

Afterwards, we will use Principal Component Analysis to capture most of the variance in just 50 features in order to improve the training time during CV.

6. Model Selection

It seems that the ExtraTreesClassifier performs better against the training set, it might also be a good idea to "boost" this classifier using smaller decision trees using the AdaBoostClassifier with the ETC as base estimator.

7. Hyperparameter Tuning

After choosing the best model, we will tune the model Hyperparameters using different combinations using a Randomized search cross validation. In the end, the model's simplicity allowed us to try all combinations

Gridsearch CV code

If we wanted to try out more combinations we could use the following code for Grid-searching.

from sklearn.model_selection import GridSearchCV rf0 = RandomForestClassifier(n_jobs = -1)

grid = GridSearchCV( rf0, # you can pass a pipeline here too! param_grid, scoring="accuracy", cv=5, n_jobs=-1, verbose=2 ) start_time = time.time() grid.fit(X_train_features, y_train) print("--- %s seconds ---" % (time.time() - start_time))

grid.bestparams

Using the Hyperparameters in the Data without PCA

PCA is usefull for reducing the complexity of the model (specially when cross-validating as this step will always take up a lot of time), but using our non-PCAd dataset will yield better results.

clf = ExtraTreesClassifier(**{'n_jobs': -1, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_samples': None, 'max_features': 0.6, 'max_depth': None, 'bootstrap': False, 'verbose': 5}) start_time = time.time()

pipe2 = Pipeline([('Classifier', clf)])

pipe2.fit(X_train_features, y_train) print(f"{clf}, {pipe2.score(X_test_features, y_test)}") print("--- %s seconds ---" % (time.time() - start_time))

8. Training the model on the whole dataset

After our model selection, we will use the whole dataset in order to train our final classifier. This will allow us to provide the best possible model as the more data we use, the more likely it is to generalise well.

9. Predict the Test

Now all that's left is predicting the target variable!